score distribution
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights
Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.
- Oceania > New Zealand (0.04)
- North America > United States > California (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > California (0.04)
- (6 more...)
- Research Report > New Finding (0.92)
- Questionnaire & Opinion Survey (0.68)
- Government (0.92)
- Education (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.98)
- (2 more...)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (9 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Netherlands (0.04)
- North America > United States > California (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Media (1.00)
- Leisure & Entertainment (0.93)
- Information Technology > Security & Privacy (0.71)
- Law (0.70)
Better Correlation and Robustness: A Distribution-Balanced Self-Supervised Learning Framework for Automatic Dialogue Evaluation
Turn-level dialogue evaluation models (TDEMs), using self-supervised learning (SSL) framework, have achieved state-of-the-art performance in open-domain dialogue evaluation. However, these models inevitably face two potential problems. First, they have low correlations with humans on medium coherence samples as the SSL framework often brings training data with unbalanced coherence distribution. Second, the SSL framework leads TDEM to nonuniform score distribution. There is a danger that the nonuniform score distribution will weaken the robustness of TDEM through our theoretical analysis.